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Abstract 

Recent methods for estimating sparse undirected graphs for real-valued data in 
high dimensional problems rely heavily on the assumption of normality. We show 
how to use a semiparametric Gaussian copula — or "nonparanormal" — for high di- 
mensional inference. Just as additive models extend linear models by replacing 
linear functions with a set of one-dimensional smooth functions, the nonparanor- 
mal extends the normal by transforming the variables by smooth functions. We 
derive a method for estimating the nonparanormal, study the method's theoret- 
ical properties, and show that it works well in many examples. 

Keywords: Graphical models, Gaussian copula, high dimensional inference, 
sparsity, l\ regularization, graphical lasso, paranormal, occult 
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I. Introduction 



The linear model is a mainstay of statistical inference that has been extended in several im- 
portant ways. An extension to high dimensions was achieved by adding a sparsity constraint, 
leading to the lasso (Tibshirani, 1996). An extension to nonparametric models was achieved 
by replacing linear functions with smooth functions, leading to additive models (Hastie and 
Tibshirani, 1999). These two ideas were recently combined, leading to an extension called 
sparse additive models (SpAM) (Ravikumar et al., 2008b, a). In this paper we consider a 
similar nonparametric extension of undirected graphical models based on multivariate Gaus- 
sian distributions in the high dimensional setting. Specifically, we use a high dimensional 
Gaussian copula with nonparametric marginals, which we refer to as a nonparanormal dis- 
tribution. 

If X is a p-dimensional random vector distributed according to a multivariate Gaussian 
distribution with covariance matrix S, the conditional independence relations between the 
random variables Xi,X 2 , . . . ,X p are encoded in a graph formed from the precision matrix 
Q = Specifically, missing edges in the graph correspond to zeroes of Q. To estimate the 
graph from a sample of size n, it is only necessary to estimate S, which is easy if n is much 
larger than p. However, when p is larger than n, the problem is more challenging. Recent 
work has focused on the problem of estimating the graph in this high dimensional setting, 
which becomes feasible if G is sparse. Yuan and Lin (2007) and Banerjee et al. (2008) 
propose an estimator based on regularized maximum likelihood using an l\ constraint on 
the entries of f2, and Friedman et al. (2007) develop an efficient algorithm for computing 
the estimator using a graphical version of the lasso. The resulting estimation procedure has 
excellent theoretical properties, as shown recently by Rothman et al. (2008) and Ravikumar 
et al. (2009). 

While Gaussian graphical models can be useful, a reliance on exact normality is limiting. 
Our goal in this paper is to weaken this assumption. Our approach parallels the ideas behind 
sparse additive models for regression (Ravikumar et al., 2008b, a). Specifically, we replace the 
Gaussian with a semiparametric Gaussian copula. This means that we replace the random 
variable X = (Xi, . . . , X p ) by the transformed random variable f(X) = (f\(Xi), . . . , f p (X p )), 
and assume that f(X) is multivariate Gaussian. This semiparametric copula results in a 
nonparametric extension of the normal that we call the nonparanormal distribution. The 
nonparanormal depends on the functions {fj}, and a mean /i and covariance matrix S, all 
of which are to be estimated from data. While the resulting family of distributions is much 
richer than the standard parametric normal (the paranormal), the independence relations 
among the variables are still encoded in the precision matrix Q = We propose a 

nonparametric estimator for the functions {fj}, and show how the graphical lasso can be 
used to estimate the graph in the high dimensional setting. The relationship between linear 
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Figure 1: Comparison of regression and graphical models. The nonparanormal extends 
additive models to the graphical model setting. Regularizing the inverse covariance leads to 
an extension to high dimensions, which parallels sparse additive models for regression. 

regression models, Gaussian graphical models, and their extensions to nonparametric and 
high dimensional models is summarized in Figure 1. 

Most theoretical results on semiparametric copulas focus on low or at least finite dimen- 
sional models (Tsukahara, 2005). Models with increasing dimension require a more delicate 
analysis; in particular, simply plugging in the usual empirical distribution of the marginals 
does not lead to accurate inference. Instead we use a truncated empirical distribution. We 
give a theoretical analysis of this estimator, proving consistency results with respect to risk, 
model selection, and estimation of Q in the Frobenius norm. 

In the following section we review the basic notion of the graph corresponding to a multi- 
variate Gaussian, and formulate different criteria for evaluating estimators of the covariance 
or inverse covariance. In Section 3 we present the nonparanormal, and in Section 4 we dis- 
cuss estimation of the model. We present a theoretical analysis of the estimation method 
in Section 5, with the detailed proofs collected in an appendix. In Section 6 we present 
experiments with both simulated data and gene microarray data, where the problem is to 
construct the isoprenoid biosynthetic pathway. 



II. Estimating Undirected Graphs 

Let X = (Xi, . . . ,X p ) denote a random vector with distribution P = N(fi, E). The undi- 
rected graph G = (V, E) corresponding to P consists of a vertex set V and an edge set 
E. The set V has p elements, one for each component of X. The edge set E consists of 
ordered pairs where € E if there is a edge between X; L and Xj. The edge between 
is excluded from E if and only if X t is independent of Xj given the other variables 
0\{i d y = (X s : 1 < s < p, s ^ written 



Xi II Xj 



0\ {i , j} . (2.1) 
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It is well-known that, for multivariate Gaussian distributions, (2.1) holds if and only if 



fijj = where Q = E 1 . 



Let lW,l( 2 ', . . be a random sample from P, where X« G M p . If n is much larger 

than p, then we can estimate £ using maximum likelihood, leading to the estimate Q = S -1 , 
where 

s = -J2(x (l) -x) (x«-X) T 

1=1 

is the sample covariance, with X the sample mean. The zeroes of Q can then be estimated 
by applying hypothesis testing to Q (Drton and Perlman, 2007, 2008). 

When p > n, maximum likelihood is no longer useful; in particular, the estimate £ is not 
positive definite, having rank no greater than n. Inspired by the success of the lasso for 
linear models, several authors have suggested estimating E by minimizing 

-£(li,n)+\J2\^jk\ (2.2) 

where 

1(H, n) = - (log \n\- tr(nS) - p log(27r)) (2.3) 

is the average log-likelihood and S is the sample covariance matrix. The estimator f2 can 
be computed efficiently using the glasso algorithm (Friedman et al., 2007), which is a block 
coordinate descent algorithm that uses the standard lasso to estimate a single row and column 
of Q in each iteration. Under appropriate sparsity conditions, the resulting estimator Q has 
been shown to have good theoretical properties (Rothman et al., 2008; Ravikumar et al., 
2009). 

There are several different ways to judge the quality of an estimator E of the covariance 
or inverse covariance Q. We discuss three in this paper, persistency, norm consistency, and 
sparsistency. Persistency means consistency in risk, when the model is not assumed to be 
correct. Suppose the true distribution is P has mean /i , and that we use a multivariate 
normal p(x; fiQ, E) for prediction. We do not assume that P is normal. We observe a new 
vector X ~ P and define the prediction risk to be 

i?(£) = -Elogp(X ;/U o,S) = - J logp(x;/x ,E) dP{x). 

It follows that 

it>(£) = - (tr(E-%) +log|E| -plog(27r)) 

where E is the covariance of X under P. If S is a set of covariance matrices, the oracle is 
defined to be the covariance matrix £* that minimizes -R(E) over S: 

£* = argmin Se5 i?(E). 
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Thus p(x; /i , £*) is the best predictor of a new observation among all distributions in 
{p(x; Ho, £) : £ G 5}. In particular, if 5 consists of covariance matrices with sparse graphs, 
then p(x; /i , £*) is, in some sense, the best sparse predictor. An estimator £ n is persistent 
if 

i?(£„) - i2(E„) 4 

as the sample size n increases to infinity. Thus, a persistent estimator approximates the best 
estimator over the class S, but we do not assume that the true distribution has a covariance 
matrix in S, or even that it is Gaussian. Moreover, we allow the dimension p = p n to 
increase with n. On the other hand, norm consistency and sparsistency require that the 
true distribution is Gaussian. In this case, let S denote the true covariance matrix. An 
estimator is norm consistent if 

||£ n -£|| ^0 

where || • || is a norm. If denotes the edge set corresponding to fi. An estimator is 

sparsistent if 

w(E(n)^E(nS) ->o. 

Thus, a sparsistent estimator identifies the correct graph consistently. We summarize known 
results on these properties for the multivariate normal in Section 5, before presenting our 
theoretical analysis of the nonparanormal. 



III. The Nonparanormal 

We say that a random vector X = (X 1 , . . . , X P ) T has a nonparanormal distribution if there 
exist functions {fj} p j=1 such that Z = f(X) ~ N((i, E), where f(X) = (/i(Xi), . . . , / P (X P )). 
We then write 

X~NPN(n,V,f). 

When the f/s are monotone and differentiable, the joint probability density function of X 
is given by 

Lemma 3.1. Tiie nonparanormal distribution NPN(n, E, /) is a Gaussian copula when 
the fj 's are monotone and differentiable. 

Proof. By Sklar's theorem (Sklar, 1959), any joint distribution can be written as 

...,x p ) = C{Fi(a:i), . . . , F p (x p )} 
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where the function C is called a copula. For the nonparanormal we have 

F( Xl , ...,x p ) = ^ iE ($- 1 (F 1 ( a : 1 )), . . . , $- 1 (F p (x p ))) 

where is the multivariate Gaussian cdf and $ is the univariate standard Gaussian cdf. 
Thus, the corresponding copula is 

C( Ul ,...,u p ) = $ At)S ($- 1 (M 1 ),...,<l>- 1 (M p )). 

This is exactly a Gaussian copula with parameters ji and E. If each fj is differentiable then 
the density of X has the same form as (3.1). □ 

Note that the density in (3.1) is not identifiable; to make the family identifiable we demand 
that fj preserve means and variances: 

Hj = E(Zj) = E(Xj) and a) = % = Var (Zj) = Var (Xj) . (3.2) 

Note that these conditions only depend on diag(S) but not the full covariance matrix. 
Let Fj(x) denote the marginal distribution function of Xj. Then 

Fj(x) = FiX, < x) =F(Z J < fjix)) = $ 

which implies that 

f j {x)=n j + (Tj$- 1 (Fj(x)). 

The following basic fact says that the independence graph of the nonparanormal is encoded 
in Q = as for the parametric normal. 

Lemma 3.2. If X ~ NPN(/j,,H, f) is nonparanormal and each fj is differentiable, then 
Xi II Xj | 0\{ijy if and only ifVtij = 0, where Q = 

Proof. From the form of the density (3.1), it follows that the density factors with respect 
to the graph of fi, and therefore obeys the global Markov property of the graph. □ 

Next we show that the above is true for any choice of identification restrictions. 
Lemma 3.3. Define 

h j (x) = $- 1 (F j (x)) (3.3) 
and let A be the covariance matrix of h(X). Then Xj II X k \ 0\{j,fe} if and only if Aj£ = 0. 



Figure 2: Densities of three 2- dimensional nonparanormals. The component functions have 
the form fj(x) = siga(x)\x\ aj . Left: ct\ = 0.9, a 2 = 0.8; center: an = 1.2, a 2 = 0.8; right 
ai = 2, a 2 = 3. In each case /i = (0, 0) and E = ( 5 '\ 



Proof. We can rewrite the covariance matrix as 

= Cov(Zj,Z k ) = a j a k Cov(h j (X j ), h k (X k )). 

Hence £ = DAD and 

S- 1 = D^A^D- 1 . 

where D is the diagonal matrix with diag(-D) = a. The zero pattern of A" 1 is therefore 
identical to the zero pattern of E . □ 

Thus, it is not necessary to estimate \i or a to estimate the graph. 

Figure 2 shows three examples of 2-dimensional nonparanormal densities. In each case, the 
component functions fj{x) take the form 

fj(x) = ajsign(x)\x\ aj + bj 
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where the constants aj and bj are set to enforce the identifiability constraints (3.2). The 
covariance in each case is £ = and the mean is /i = (0, 0). The exponent aj determines 
the nonlinearity. It can be seen how the concavity of the density changes with the exponent 
a, and that a > 1 can result in multiple modes. 

The assumption that f(X) = (fi(Xi), . . . , f p (X p ) is normal leads to a semiparametric model 
where only one dimensional functions need to be estimated. But the monotonicity of the 
functions fj, which map onto R, enables computational tractability of the nonparanormal. 
For more general functions /, the normalizing constant for the density 

p x (x) oc exp j-i (f(x) -/x) T E _1 (f(x) 

cannot be computed in closed form. 



IV. Estimation Method 

Let X«, . . .,JW be a sample of size n where X® = (xf, . . . ,X^) T e W . In light of 
(3.3) we define 

where Fj is an estimator of Fj. A natural candidate for Fj is the marginal empirical distri- 
bution function 

1 n 

Now, let 9 denote the parameters of the copula. Tsukahara (2005) suggests taking 9 to be 
the solution of 

n 

J2<P(F 1 (X?),...,F P (X^),9)=0 
1=1 

where is an estimating equation and Fj(t) = nFj(t)/(n + 1). In our case, 9 corresponds to 
the covariance matrix. The resulting estimator 9, called a rank approximate Z-estimator, has 
excellent theoretical properties. However, we are interested in the high dimensional scenario 
where the dimension p is allowed to increase with n; the variance of Fj(t) is too large in this 
case. Instead, we use the following truncated or Winsorized 1 estimator: 

{5 n if Fj{x)<5 n 

Fj(x) if5 n <Fj(x)<l-5 n (4.1) 
(l-5 n ) ii Fj(x) > 1 - 6 n , 

1 After Charles P. Winsor, whom John Tukey credited with converting him from topology to statistics 
(Mallows, 1990). 
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where 5 n is a truncation parameter. Clearly, there is a bias- variance tradeoff in choosing 5 n . 
In what follows we use 

Sn = ; ; (4.2) 

This provides the right balance so that we can achieve the desired rate of convergence in our 
estimate of f2 and the associated undirected graph G. 

Given this estimate of the distribution of variable Xj, we then estimate the transformation 
function fj by 

Jj(x) = 'iij+djh j (x) (4.3) 

where 



hj(x) = fax) 
and jlj and dj are the sample mean and the standard deviation: 



i=l \ i=l 



Now, let S n (f) be the sample covariance matrix of f(X^), . . . , f(X^); that is, 

Sn(7) = iE(7(x«)-^ t (7))(7(x«)-M7)) T (4.4) 

i n ~ 

^(/) = -£/(* (0 )- 
i=i 

We then estimate Q using S n (f). For instance, the maximum likelihood estimator is f2 n = 
Snif) 1 - The ^-regularized estimator is 

fi n = argminjtr (OS n (7)) - log + A||Q||i} (4.5) 

where A is a regularization parameter, and = Ylj^k The estimated graph is then 

E n = {(j,k) : Vtj k 7^ 0}. In the following section we analyze the theoretical properties of 
this ^-regularized estimator. 



V. Theoretical Results 

In this section we present our theoretical results on risk consistency, model selection consis- 
tency, and norm consistency of the covariance S and inverse covariance fl. From Lemma 3.3, 
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the estimate of the graph does not depend on <jj, j G {1, . . . ,p} and //, so we assume that 
Gj = 1 and fi = 0. Our key technical result is an analysis of covariance of the Winsorized 
estimator defined in (4.1), (4.3), and (4.4). In particular, we show that under appropriate 
conditions, 



max 



S n {f)jk S n (f)jk 



Or 



log p log 2 n 



n 



1/2 



where S n (f)jk denotes the (j,k) entry of the matrix. This result allows us to leverage the 
recent analysis of Rothman et al. (2008) and Ravikumar et al. (2009) in the Gaussian case 
to obtain consistency results for the nonparanormal. More precisely, our main theorem is 
the following. 



Theorem 5.1. Suppose that p = rfi and let f he the Winsorized estimator defined in (4.3) 
with 5„ = — — — , Define 



4n 1 / 4 v / 7rlogn 



C(M,0 = 4=e (V2M- l) (M + 2) 



for M,£>0. Then for any e > C(M, {) ' 



log p log n 



n 



1/2 



and sufficiently large n, we have 



P ( max 

jk 



S n {f)jk ~ Sn(f)jk 



Cip 



c 2 p 



>el < ^%? + ^ T + c 3exp 



[ne 



c 4 n 1 / 2 e 2 
log p log 2 n ) 



where Ci, c 2 , c 3 , c 4 are positive constants. 



The proof of the above theorem is given in Section 7. The following corollary is immediate, 
which specifies the scaling of the dimension in terms of sample size. 

Corollary 5.2. Let M > 1 + ^. Then 



P I max 

jk 



S ll (f) J i,--S„(f) J , > C \!.' \I ] ° KP] ^ U | =o(l) + c 3 exp(-c 4 C 2 (M,0). 



n 1 



Hence, 



max 



S n {f)jk S n (f)jk 



} 



log p log n 



n 



1/2 



(5.1) 
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The following corollary yields estimation consistency in both the Frobenius norm and the 
^2-operator norm. The proof follows the same arguments as the proof of Theorem 1 and 
Theorem 2 from Rothman et al. (2008), replacing their Lemma 1 with our Theorem 5.1. 

For a matrix A = (a^), the Frobenius norm || ■ \\ F is defined as ||A|| F = a %- The 

^2-operator norm || • || 2 is defined as the magnitude of the largest eigenvalue of the matrix, 
||A|| 2 = max|| a .|| 2=1 ||Ac|| 2 . In the following, we write a n x b n if there are positive constants 
c and C independent of n such that c < a n /b n < C . 



Corollary 5.3. Suppose that the data are generated as ~ NPN(fx ,T lQ , f ), and let 
fio = Sq 1 . If the regularization parameter \ n is chosen as 



' log p log 2 n 



A 

then the nonparanormal estimator Q n of (4.5) satisfies 

W^n — ^o||f = Op 



s + p) log p log 2 n 



1/2 



s log p log 2 n 



n 



1/2 



n 

and 

ll^ra — ^olh — Op 

where 

s = Card e {1, . . . ,p} x {1, . . . ,p} | no(i,j) + 0, i± j}) 

is the number of nonzero off-diagonal elements of the true precision matrix. 



To prove the model selection consistency result, we need further assumptions. We follow 
Ravikumar (2009) and let the p 2 x p 2 Fisher information matrix of So be T = So <E> So where 
<E> is the Kronecker matrix product, and define the support set S of f2o = S 1 as 

S ee e {1, . . . , P } x {1, . . . ,p} | no(i,j) 0} . 



We use S c to denote the complement of S in the set {1, . . . ,p} x {1, . . . ,p}, and for any two 
subsets T and T' of {1, . . . ,p} x {1, . . . ,p}, we use T TT / to denote the sub-matrix with rows 
and columns of T indexed by T and T' respectively. 
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Assumption 5.4. There exists some a G (0, 1], such that \\T S c S (T ss ) 1 || oc < 1 — a. 

As in Ravikumar et al. (2009), we define two quantities K^ = ||So||oo and K r = || (r^^) — 1 1| oo- 
Further, we define the maximum row degree as 

d = max Card ({j G 1, . . . , p \ Q (i,j) ^ 0}) . 

i=l,-,P 

Assumption 5.5. The quantities K^o and K r are hounded, and there are positive con- 
stants C\ and C 2 such that 



log n o 
min \fl (j,k)\>Ci\l and n > C 2 d logp 

(j,k)es V n 

for large enough n. 

The proof of the following uses our Theorem 5.1 in place of equation (12) in the analysis of 
Ravikumar et al. (2009), 



Corollary 5.6. Suppose the regularization parameter is chosen as 

X„ > 



' log p log 2 n 



n l/2 ■ 

Then the nonparanormal estimator fl n satisfies 

P(0 (fL^o)) > I- o(l) 

where G(£l n , fio) is the event 

{sign (n n (j, k)j = sign (fl^U, k)) , Vj, k G s} . 

Our persistency (risk consistency) result parallels the persistency result for additive models 
given in Ravikumar et al. (2008a), and allows model dimension that grows exponentially 
with sample size. The definition in this theorem uses the fact (from Lemma 7.1) that 
sup,, (Fj(x)j < V2 \ogn when 5 n = l/(4n 1 / 4 ^ logn). 

In the next theorem, we do not assume the true model is nonparanormal and define the 
population and sample risks as 

R(f,Q) = ±{tr[QE(f(X)f(X) T } -log|fi|-plog(27r)} 

R(f,Q) = i{tr[Q^(/)]-log|Q|-plog(27r)}. 
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Theorem 5.7. Suppose that p < e nt for some £ < 1, and define the classes 
M. n = j/ : / is monotone with ||/||oo < C A/log n| 

c n = : Hir 1 !^^}. 

Let Q n be given by 

fi n = argminhr (siS n (f)) - log . 

Then 



i?(/ n , n„) - inf R(f, Q) = Op[ lJ 1 ^ 

Hence the Winsorized estimator of (/, Jl) with <5„ = l/(4n 1//4 v / 7r logn) is persistent over C n 
when L n = o (n^^^/y/Tognj. 

The proofs of Theorems 5.1 and 5.7 are given in Section 7. 



VI. Experimental Results 

In this section, we report experimental results on synthetic and real datasets. We mainly 
compare the ^-regularized nonparanormal and Gaussian (paranormal) models, computed 
using the graphical lasso algorithm (glasso) of Friedman et al. (2007). The primary conclu- 
sions are: (i) When the data are multivariate Gaussian, the performance of the two methods 
is comparable; (ii) when the model is correct, the nonparanormal performs much better than 
the graphical lasso in many cases; (hi) even for distributions that are not nonparanormal, 
the new method often performs better; (iv) for gene microarray data, our method behaves 
differently from the graphical lasso, and may support different biological conclusions. 

Note that we can reuse the glasso implementation to fit a sparse nonparanormal. In partic- 
ular, after computing the Winsorized sample covariance S n (f), we pass this matrix to the 
glasso routine to carry out the optimization 

h n = argmin jtr (ttS n (f)^ - log |0| + A n ||Q||i| . 



A. Neighborhood graphs 

We begin by describing a procedure to generate graphs as in (Meinshausen and Buhlmann, 
2006), with respect to which several distributions can then be defined. We generate a p- 
dimensional sparse graph G = (V,E) as follows: Let V = {!,..., p} corresponding to 
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variables X = (Xi, . . . ,X P ). We associate each index j with a point {Y^\Y^) G [0, l] 2 
where 

Yi (fc) ,...,Y n (fc) ~ Uniform [0, 1] 
for k = 1, 2. Each pair of nodes is included in the edge set E with probability 

where = (y^y^) is the observation of (Y^\y^) and || • || n represents the Euclidean 
distance. Here, s = 0.125 is a parameter that controls the sparsity level of the generated 
graph. We restrict the maximum degree of the graph to be four and build the inverse 
covariance matrix Q according to 

f 1 if % = j 

n (i,j) = l 0.245 ii(i,j)eE (6.1) 
^ otherwise, 

where the value 0.245 guarantees positive definiteness of the inverse covariance matrix. 
Given fl , n data points are sampled from 

xW,...,xW~NPN(no,?: J ) (6.2) 

where /io = (1.5, . . . , 1.5), S = fig 1 . For simplicity, the transformations functions for all 
dimensions are the same fi — ■ ■ ■ — f p — fo- To sample data from the nonparanormal 
distribution, we also need g = f^ 1 , two different transformations g are employed: 

Next, we define the following two transformation families: 



Definition 6.1. (Gaussian CDF Transformation) Let g be a one- dimensional Gaussian 
cumulative distribution function with mean /i go and the standard deviation a go , i.e., 



90 



9o(t) = $ 

We define the transformation function gj = fj l for the j-th dimension as 

( . . \ 

9 Ah) = °i ) '' === + ^ 



\ 



9o(y)- g»{t)<t>(^f)dt) <t>(^f)dy 



J 



where <jj = S (j,j). 
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Definition 6.2. (Symmetric Power Transformation) Let g be the symmetric and odd trans- 
formation given by 



9o(t) 



sign(t)|t 



a 



where a > is a parameter. We define the power transformation for the j-th dimension as 



These transformation are constructed to preserve the marginal mean and standard deviation. 
In the following experiments, we refer to them as the cdf transformation and the power 
transformation, respectively. For the cdf transformation, we set ji go = 0.05 and a go = 0.4. 
For the power transformation, we set a = 3. 

To visualize these two transformations, we sample 5000 data points from a one-dimensional 
normal distribution iV(0.5, 1.0) and then apply the above two transformations; the results are 
shown in Figure 3. It can be seen how the cdf and power transformations map a univariate 
normal distribution into a highly skewed and a bi-modal distribution, respectively. 

To generate synthetic data, we set p = 40, resulting in ( 4 2 °) + 40 = 820 parameters to be 
estimated, and vary the sample sizes from n = 200 to n — 1000. Three conditions are consid- 
ered, corresponding to using the cdf transform, the power transform, or no transformation. 
In each case, both the glasso and the nonparanormal are applied to estimate the graph. 

A.l Comparison of regularization paths 

We choose a set of regularization parameters A; for each A G A, we obtain an estimate 
Q n which is a 40 x 40 matrix. The upper triangular matrix has 780 parameters; we can 
vectorize it to get a 780-dimensional parameter vector. A regularization path is trace of these 
parameters over all the regularization parameters within A. The regularization paths for both 
methods are plotted in Figure 4. For the cdf transformation and the power transformation, 
the nonparanormal separates the relevant and the irrelevant dimensions very well. For the 
glasso, relevant variables are mixed with irrelevant variables. If no transformation is applied, 
the paths for both methods are almost the same. 

A. 2 Estimated transformations 

For sample size n = 1000, we plot the estimated transformations for three of the variables in 
Figure 5. It is clear that Winsorization plays a significant role for the power transformation. 
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Figure 3: The power and cdf transformations. The densities are estimated using kernel density 
estimator with bandwidths selected by unbiased cross-validation. 



This is intuitive due to the high skewness of the nonparanormal distribution resulting from 
the power transformations. 

A. 3 Quantitative comparison 

To evaluate the performance for structure estimation quantitatively, we use false positive 
and false negative rates. Let G = (V, E) be a p-dimensional graph (which has at most (|) 
edges) in which there are \E\ = r edges, and let G x = (V,E X ) be an estimated graph using 
the regularization parameter A. The number of false positives at A is 

FP(A) = number of edges in E x not in E 

The number of false negatives at A is defined as 

FN(A) = number of edges in E not in E x . 
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0.0 0.1 0.2 0.3 0.4 0.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.0 0.1 0.2 0.3 0.4 



n = 200 

Figure 4: Regularization paths for the glasso and nonparanormal with n = 500 (top) and n = 200 
(bottom). The paths for the relevant variables (nonzero inverse covariance entries) are plotted as 
solid (black) lines; the paths for the irrelevant variables are plotted as dashed (red) lines. 
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Figure 5: Estimated transformations for the first three variables. 



The oracle regularization level A* is then 

A* = argmin {FP(A) + FN(A)} . 

AeA 

The oracle score is FP(A*) + FN(A*). Figure 6 shows boxplots of the oracle scores for the 
two methods, calculated using 100 simulations. 

To illustrate the overall performance of these two methods over the full paths, ROC curves 
are shown in Figure 7, using 




FN(A) FP(A) \ 
r ' Q-r)- 
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Figure 6: Boxplots of the oracle scores for n = 1000,500,200 (top, center, bottom). 

The curves clearly show how the performance of both methods improves with sample size, 
and that the nonparanormal is superior to the Gaussian model in most cases. 

Let FPE = FP(A*) and FNE = FN(A*), Tables 1, 2, and 3 provide numerical comparisons of 
both methods on datasets with different transformations, where we repeat the experiments 
100 times and report the average FPE and FNE values with the corresponding standard 
deviations. It's clear from the tables that the nonparanormal achieves significantly smaller 
errors than the glasso if the true distribution of the data is not multivariate Gaussian and 
achieves comparable performance as the glasso when the true distribution is exactly multi- 
variate Gaussian. 
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Figure 7: ROC curves for sample sizes n = 1000,500,200 (top, middle, bottom). 
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Table 1: Quantitative comparison on the dataset using the cdf transformation 



Nonparanormal glasso 



n 


FPE 


(sd(FPE)) 


FNE 


(sd(FNE)) 


FPE 


(sd(FPE)) 


FNE 


(sd(FNE)) 


1000 


0.10 


(0.3333) 


0.05 


(0.2190) 


3.73 


(2.3904) 


7.24 


(3.2910) 


900 


0.18 


(0.5389) 


0.16 


(0.4197) 


3.31 


(2.4358) 


8.94 


(3.2808) 


800 


0.16 


(0.5069) 


0.23 


(0.5659) 


3.80 


(2.9439) 


9.91 


(3.4789) 


700 


0.26 


(0.6295) 


0.43 


(0.7420) 


3.45 


(2.5519) 


12.26 


(3.5862) 


600 


0.33 


(0.6039) 


0.41 


(0.6371) 


3.31 


(2.8804) 


14.25 


(4.0735) 


500 


0.58 


(0.9658) 


1.10 


(1.0396) 


3.18 


(2.9211) 


17.54 


(4.4368) 


400 


0.71 


(1.0569) 


1.52 


(1.2016) 


1.58 


(2.3535) 


21.18 


(4.9855) 


300 


1.37 


(1.4470) 


2.97 


(2.0123) 


0.67 


(1.6940) 


23.14 


(5.0232) 


200 


2.03 


(1.9356) 


7.13 


(3.4514) 


0.01 


(0.1000) 


24.03 


(4.9816) 



Table 2: Quantitative comparison on the dataset using the power transformation 



Nonparanormal glasso 



n 


FPE 


(sd(FPE)) 


FNE 


(sd(FNE)) 


FPE 


(sd(FPE)) 


FNE 


(sd(FNE)) 


1000 


0.27 


(0.7086) 


0.35 


(0.6571) 


2.89 


(1.9482) 


4.97 


(2.9213) 


900 


0.38 


(0.6783) 


0.41 


(0.6210) 


2.98 


(2.3697) 


5.99 


(3.0467) 


800 


0.25 


(0.5751) 


0.73 


(0.8270) 


4.10 


(2.7834) 


6.39 


(3.3571) 


700 


0.69 


(0.9067) 


0.90 


(1.0200) 


4.42 


(2.8891) 


8.80 


(3.9848) 


600 


0.92 


(1.2282) 


1.59 


(1.5314) 


4.64 


(3.3830) 


10.58 


(4.2168) 


500 


1.17 


(1.3413) 


2.56 


(2.3325) 


4.00 


(2.9644) 


13.09 


(4.4903) 


400 


1.88 


(1.6470) 


4.97 


(2.7687) 


3.14 


(3.4699) 


17.87 


(4.7750) 


300 


2.97 


(2.4181) 


7.85 


(3.5572) 


1.36 


(2.3805) 


21.24 


(4.7505) 


200 


2.82 


(2.6184) 


14.53 


(4.3378) 


0.37 


(0.9914) 


24.01 


(5.0940) 
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Table 3: Quantitative comparison on the dataset without any transformation 







Nonparanormal 






glasso 




n 


FPE 


(sd(FPE)) 


FNE 


(sd(FNE)) 


FPE 


(sd(FPE)) 


FNE 


(sd(FNE)) 


1000 


0.10 


(0.3333) 


0.05 


(0.2190) 


0.09 


(0.3208) 


0.06 


(0.2386) 


900 


0.24 


(0.7537) 


0.14 


(0.4025) 


0.22 


(0.6447) 


0.15 


(0.4113) 


800 


0.17 


(0.4277) 


0.16 


(0.3949) 


0.16 


(0.4431) 


0.19 


(0.4191) 


700 


0.25 


(0.6871) 


0.33 


(0.8534) 


0.29 


(0.8201) 


0.27 


(0.7501) 


600 


0.37 


(0.7740) 


0.36 


(0.7456) 


0.36 


(0.7722) 


0.37 


(0.6459) 


500 


0.28 


(0.5874) 


0.46 


(0.7442) 


0.25 


(0.5573) 


0.45 


(0.6571) 


400 


0.55 


(0.8453) 


1.37 


(1.2605) 


0.47 


(0.7713) 


1.35 


(1.2502) 


300 


1.24 


(1.3715) 


3.07 


(1.7306) 


0.98 


(1.2058) 


3.04 


(1.8905) 


200 


1.62 


(1.7219) 


5.89 


(2.7373) 


1.55 


(1.6779) 


5.62 


(2.6620) 



A. 4 Visualization of typical runs 

Figure 8 shows typical runs for the cdf and power transformations. It's clear that when the 
glasso estimates the graph incorrectly, the mistakes include both false positives and negatives. 

B. Gene microarray data 

In this study, we consider a dataset based on Affymetrix GeneChip microarrays for the 
plant Arabidopsis thaliana, (Wille, 2004). The sample size is n = 118. The expression 
levels for each chip are pre-processed by log-transformation and standardization. A subset 
of 40 genes from the isoprenoid pathway are chosen, and we study the associations among 
them using both the paranormal and nonparanormal models. Even though these data are 
generally treated as multivariate Gaussian in the previous analysis (Wille, 2004), our study 
shows that the results of the nonparanormal and the glasso are very different over a wide 
range of regularization parameters. This suggests the nonparanormal could support different 
scientific conclusions. 

B. 1 Comparison of the regularization paths 

We first compare the regularization paths of the two methods, in Figure 9. To generate 
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Figure 8: Typical runs for the two methods for n = 1000 using the cdf and power transformations. 
The dashed (black) lines in the symmetric difference plots indicate edges found by the glasso but 
not the nonparanormal, and vice-versa for the solid (red) lines. 
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the paths, we select 50 regularization parameters on an evenly spaced grid in the interval 
[0.16,1.2]. Although the paths for the two methods look similar, there are some subtle 
differences. In particular, variables become nonzero in a different order, especially when 
the regularization parameter is in the range A G [0.2,0.3]. As shown below, these subtle 
differences in the paths lead to different model selection behaviors. 



glasso path nonparanormal path 




0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 

lambda lambda 

Figure 9: The regularization paths of both methods on the microarray dataset. 



B.2 Comparison of the selected graphs 

Figure 10 compares the estimated graphs for the two methods at several values of the reg- 
ularization parameter A in the range [0.16,0.37]. For each A, we show the estimated graph 
from the nonparanormal in the first column. In the second column we show the graph ob- 
tained by scanning the full regularization path of the glasso fit and finding the graph having 
the smallest symmetric difference with the nonparanormal graph. The symmetric difference 
graph is shown in in the third column. The closest glasso fit is different, with edges selected 
by the glasso not selected by the nonparanormal, and vice-versa. Several estimated trans- 
formations are plotted in Figure 11, which are are nonlinear. Interestingly, several of the 
differences between the fitted graphs are related to these variables. 
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Figure 10: The nonparanormal estimated graph for three values of A = 0.2448,0.2661,0.30857 
(left column), the glasso estimated graph (middle) and the symmetric difference graph (right). 




Figure 11: Estimated transformations for the microarray dataset, indicating non-Gaussian 
marginals. The corresponding genes are among the nodes appearing in the symmetric difference 
graphs of Figure 10. 
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VII. Proofs 



We assume, without loss of generality from Lemma 3.3, that pij = and aj = 1 for all 
j — 1, . . . ,p. Thus, define fj(x) = ^~ 1 (F j (x)) and = ^~ 1 (F j (x)), and let ^ = 

A Proof of Theorem 5.1 

We start with some useful lemmas; the first is from Abramovich et al. (2006). 

Lemma 7.1. (Gaussian Distribution function vs. Quantile function) Let $ and denote the 
distribution and density functions of a standard Gaussian random variable. Then 

M < 1 - < ^ if t > 1 (7.1) 

2t ~ w ~ t ~ y ' 

and 

(*-')'<") = < 7 ' 2 > 



Also, for n > 0.99, we have 



^-\n) = ^2 log - rfa) (7.3) 



where r(n) G [0, 1.5]. 



Lemma 7.2. (Distribution function of the transformed random variable) For any a G (—00,00) 

Proof. The statement follows from 

F 3 (t) = F(X 3 <t) = <t)= < g-\t)) = $ (gj\t)) . (7.4) 

which holds for any t. □ 

Lemma 7.3. (Gaussian maximal inequality) Let Wi, . . . , W n be independently and identi- 
cally distributed standard Gaussian random variables. Then for any a > 



P f max Wi > J a logn ] < — \- -. 

\i<i<n v J ~ n a/2_1 
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Proof. Using Mill's inequality, we have 

P ( max Wi > xja \ogn\ < ^ P (w t > xj a logn) 

V - % - n J i= i 



< n 



1 



logn) _ 
y/ot log n n a / 2 ~ 1 ^2ira logn' 



from which the result follows. 



□ 



Lemma 7.4. For any a > that satisfies 1 — 5 n — $ ( v / «Togn) > for all n, we have 



P 



arid 



Fj (gj (Valogn)) > 1 - 5 n < exp |-2n (l - 5 n - $ logn)) 2 | . (7.5) 
*j (& (—s/alogra)) < 5 n < exp j-2ra (l - 5„ - $ (v 7 " 1 ^^)) 2 } • (7-6) 



Proof. Using Hoeffding's inequality, 



P 



Fj (gj (yja logn) ) > 1-5 
= P 



Fj- [y/a logn)) - F^ (Valogn)) > 1 - <5„ - Fj (Valogn)) 
< exp j-2n - 5 n - F,- (Vodogn))) J . 

Equation (7.5) then follows from equation (7.4). The proof of equation (7.6) uses the same 
argument. □ 

Now let M > 2 be some constant and set (3 — ^. We split the interval 

9j ( - x/Mlogn) , ^ ( ^Mlogn)] 

into two parts, the middle 

M„ = [gj (-y/plogn) ,gj (V/31ogn)) 

and ends 



f = 



9j (-VMlogn^J ,gj (-y/filogn) U g i ^ p log nj ,g t (y/ Mlogn 



The behaviors of the function estimates in these two regions is different, and so we first 
establish bounds on the probability that a sample can fall in the end region £ n . 
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Lemma 7.5. Let A = ^(vHm - y/fi). Then 



Proof. Using equation (7.4) and the mean value theorem, we have 



= P(x y E [ 9j ( y/P log n) , g 3 ( log n)] ) + P (x y e [ 5j ( - log n) , ^ ( - ^ log n)] ) 

= F, (^(VMlogn)) - Fj (^(V/^logn)) + F, log n)) - F, (^(-VMlogn)) 

= 2 ($(v/Mlogn) - $(V^logn)) 

< 20 ^v^lognj ^Mlogn - \Jl3\ogn} . 

The result of the lemma follows directly. □ 

We next bound the error of the Winsorized estimate of a component function over the end 
region. 



Lemma 7.6. For all n, we have 



sup 



$- 1 (F,(t))-$- 1 (F,(t)) < v/2(M + 2)logn, Vj G {1, . . . ,p}. 



Proof. From Lemma 7.2 and the definition of £ n , we have 



sup (Fj(t)) \ £ 0, yMlogn . 



Given the fact that 5 n = — — — ^=^= 

4n 1 / 4 v /7r logn 



, we have FAt) e ( — , 1 ). Therefore, from 

( n n ' 



equation (7.3), 



sup 

tee n 



(^-(0)1 e [0, \j2\ogn 



The result follows from the triangle inequality and \f~M + y/2 < \J2{M + 2). 



□ 
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Now for any e > 0, we have 



P ( max 



Sn(f)jk — S n (f) 



> e 



P I max 



< P I max 



fj( X ij)fk(X ik ) - fj(X i: j)f k (X ik ) - II„(fj)fJ,n(fk) + Hn{fj)Hn{fk) } 

i=l 

\ n ~ ~ 



> e 



+ P max //„(/j)//„(/fc) ~ 



We only need to analyze the rate for the first term above, since the second one is of higher 
order (Cai et al, 2008). Let 

A^k) = lixSkiXik) - fj{Xij)fk{X ik ) 

and 

Q t Mk) = mJk{s)-mfk{s). 

We define the event A n as 

A n = | gj (-VMlogn) < X lv ...,X nj < g 3 (y/Mlogn) , j = 1, . . . ,p} . 

Then 

¥(A c n ) < pF (mm fj (X tJ ) < -VMlogn) + pF (max > y/Mlogn^ < -g^. 

with ci as a generic positive constant. Therefore 
1 n 



P I max 



i=l 



max 

V j,* 


i=i 




+ HA c n ) 


max 


1 n 

i=i 


> e,A n 


n M/2-l 



Thus, we only need to carry out our analysis on the event A n . On this event, we have the 
following decomposition: 



P I max 



> 6, A n 



< P 



E i*w.*)i>3)+ p hs4 E ia«w,*)i>j 

2P max- ^ |A,(j,A;)|>M 

\ Xij£Mn,X ik ££„ I 
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We now analyze each of these terms separately. 



lose v lose ft 

Lemma 7.7. On the event A n , let (3 = 1/2 and e > C(M, g)\ 6F 1/9 6 — , then 



( 1 V 

I max — £ 



P I max- V \Mj,k)\ > - A I =o(l). 



4 



Proof. We define 

0i 



with the same parameter A as in Lemma 7.5. Such a #i guarantees that 



HI. _ n AJ l0Sn - nAJ l ° gn > 



By Lemma 7.5, we have 



P ^ E 1 {x ij e£n,x ilt e£„} > ^-J < P ^E l{x l3 e£„} > ^ 



< P ( E ( W*.} - P (Xy G > ^ - ^ /l0gn 



40i V 



Using the Bernstein's inequality, for /? = -, 



< exp 

where Ci, c 2 , c 3 > are generic constants. 



c x n 2 ^logn 



c 2 n 1_/3 / 2 v / logn + C3n 1_ ^/ 2 \/log^ 
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Therefore, 



I max I E > | 

= P j max — J] |Ai(j, fc)| > ^,max sup \Q t ,s{j^ k )\ > i ) 

I j,k TL 4 j,k t&£„,s&£„ I 



< P ( max sup 

•?> fe te£ n ,se£: 



= P(max sup |0 M (j,fc)| > 0i ) +o(l). 



Now, we analyze the first term 

pfmax sup \e t>s (j,k)\>9 1 ) < p 2 F ( sup \G t , 8 (j, k)\ > 6 1 

\ 3> k t££ n ,se£ n J \te£n,s££n 

= P 2 f( sup \f j (t)f k (s)-f j (t)f k (s)\>9 



n 

By adding and subtracting terms fj(t) and f s (t), we have 



p( sup !£(*)/*(*) >*i 



< Pf sup |(/,(t)-/,(t))(/ fc ( S )-/ fc ( S ))|>|) 
+ P( sup |(^)-/i(0)l'l/*(«)l>^ 

+ pf ^p |(/ fc ( a ) >|Y 



The first term can further be decomposed to be 



pf sup !(/,-(*) - /,-(*))(/*(*) - /*(*))! > 
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Also, from the definition of £ n , we have 



sup \j)(t)\ = sup \g-\t)\ < VMlog 



n. 



Since r > : \; '''^''T " • we have 



* = ^> C(M ^ ) V / ^^^2(M + 2)lo g n. 
3 244v/Iogn - 24Av^og^ v ; 6 



This implies that 

> v/2(M + 2)logn and > ^(Af + 2) logn. 

V 3 6\JM\ogn 



Then, from Lemma 7.6, we get 



p(sup|(/,(0-/,W)I>a/t) = ° 



and 



Pf sup \( fj (t) - f 3 (t))\ ■ \f k (s)\ > 6 -±) =0. 
The claim of the lemma then follows directly. □ 

Remark 7.8. From the above analysis, we see that the data in the tails doesn't affect the 
rate. Using exactly the same argument, we can also show that 

P max- \^i(jM>- A )=o(l)- 

I 3,k n z — ' 4 / 



log p log 2 n 



Lemma 7.9. On the event A n , let (3 = 1/2 and e > C(M,£)\\ ^ — . There exist 



n 



generic constants Ci, c 2 , c 3 , c 4 , such that 

1 , . / . , m e \ ( cin^^e 2 \ ( c 4 n 1 ~ /3 



max 



V \Ai(j, k)\ > - ) < c 2 exp f~ , 2 ) + c 3 exp f- 
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Proof. We have 



? I max 



«- E 1^0',*)! > 7 ] <^ F ( S ^P l/i(*)/fc(a)"/i(*)/fc(s)l> 
fc 71 Xij eM n ,x ik eM n V 



< P 2 F ( sup - £(*))(/*(*) - fk(s))\ > 

n 



+ 2/pf sup \(fj(t) - fj(t))\ ■ \f k (s)\ > . 

n / 



Further, since 



sup \fj(t)\ = sup \ 9j \t)\ = ^f3\ogn 

tGMn teM n 



and sup \(fj(t) - fj(t))(f k (s) - f k (s))\ is of higher order than sup teMn>seMn | (/_,■(*) 
teM„,seM n 

fj(t))\-\fh(s)\, we only need to analyze the term P (sup teMn - fj{t))\ > ^-^== 

Since 5 n = — , , using Mill's inequality we have 

4nP' 2 ^/27rf3 logn 

2*„ = f^<l-*(^l^). 
2y/p logn 



This implies that 



l-6 n - $(7/3 logn) > 5 n > 0. 

Using Lemma 7.4, we have 



2 \ / n l-0 



p2F (* (V^)) > 1 - *0 * «P (-afej - ^ Ug^logn) ) (7 ' 7) 
and 

" 2P ($ (* (-v^)) < *.) S exp (- logp( ^ logK) ) ■ (7.8) 

Define an event B n as 

B n = {d n < Fj (g 3 (V/51ogn)) < 1 - <f n , j = 1, . . . ,p} . 
From (7.7) and (7.8), it is easy to see that 



PK) < c 3 exp 



logp(logn) 
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where c 3 and c 4 are generic positive constants. 
From the definition of Fj, we have 

p 2 p(sup \{f j (t)-f j (t))\> 



teM, 

< p 2 F[ sup 

yteMn 



12v^logn 
S" 1 (m)-^ (Fjit)) 



> 



< P 2 



sup 

teM„ 



Q- 1 fatj) - Q- 1 {F&)) 



> 



e 



B n )+F(B n ). 



12V/? log n 



+ c 3 exp 



1-13 



logp(logn) 



Define 



T Vn = max |Fj (V/^logn)) , 1 - <J n | and T 2n = 1 - min |Fj (ft (-A//31ogn)) , 5 n 
From equation (7.4) and the fact that 1 — 5 n > $ (V /3 log n) , we have that 

T\ n = T 2n — 1 — 8 n . 
Thus, by the mean value theorem, 



P( sup 

yteMn 



ar 1 (j^)-^ 1 (*>(*)) 



> 



12v/#logn 



< P ( (max {T ln , T 2n }) sup F,(t) - Fj(t) 

teM n 



> 



l2y/(3 \ogn 



= P ($-7(1-5,,) sup F,(t)-F 3 (t) 

teM 



> 



12y/f3\ogn 



Finally, using the Dvoretzky-Kiefer-Wolfowitz inequality, 



P( sup 

K teM n 



q- 1 fat)) - q- 1 m)) 



> 



12^/3 log n 



< P ( sup 

,t£M„ 



> 



($-!)' (1 - 5 n ) \2^Xogn 



< 2 exp -2 



ne 



144/3 log n [($~7 (l-5 n ) 



Furthermore, by Lemma 7.1, 

(*-7(l-<5n) ! 



^ ($-i(l-*„)) 



< 



21og- 



n. 
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This implies that 



p P ( sup 

\teM n 



> < 2 exp 



12y/ (3\ogn J V logplog 2 n 



cxn 1 ^e 2 



where Ci is a generic constant. 
In summary, we have 
( 



P 



max- > Ai(j,fc)>- <c 2 exp — — +c 3 exp 



logp(logn) 

where ci, c 2 , c 3 , c 4 are generic constants. □ 
The conclusion of Theorem 5.1 follows from Lemma 7.7 and Lemma 7.9. 

B. Proof of Theorem 5. 7 

Proof. First note that the population and sample risks are 

R(f,Q) = ±{tr[QE(f(X)f(X) T ] -log|Q|-plog(27r)} 
R(f,Sl) = i{tr[^„(/)]-log|Q|-plog(27r)}. 
Therefore, for all (/, Q) G M p n © C n , we have 

= ^|tr[fi(E[// T ]-S n (/))]| 

< i H^l^ max sup \E{fj(Xj)f k (X k ) - S n (f) jk \ 

— max sup 

2 i k fi,f k eM„ 

Now, if T is a class of functions, we have 

Efsupl^-M^l) < C7J[l( " J ?°°" F) (7-9) 

for some C > 0, where = sup 9gci? |<7(x)|, /i(g) = E(g(X)) and /2(g) = n" 1 Y17=i di-X-i) 

(see Corollary 19.35 of van der Vaart (1998)). Here the bracketing integral is defined to be 

r-<5 



< ^max sup |E(/j(Xj)/ fc (X fc ) - S n (f) jk \. 



J U (8,F) = J yjlog N u (u,F)du 
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where log iV[](e, T) is the bracketing entropy. For the class of one dimensional, bounded and 
monotone functions, the bracketing entropy satisfies 



log N u (e,M) < K 
for some K > (van der Vaart and Wellner, 1996). 



Now, let V n ,p be the class of all functions of the form m(x) = fj{xj)fk{xk) for j, k G 
{1, . . . where fj G M. n for each j. Then the bracketing entropy satisfies 



\ogN {] {C^f^i,V n , p ) < 2\og V + K (- 



and the bracketing integral satisfies J[](C y/logn, V n , P ) = O^lognlogp). It follows from 
(7.9) and Markov's inequality that 



max sup \S n (f) jk -E(f j (X j )f k (X k )\ = P [ W !2l!i!2i£ ) = 0p [ \ ^ 
Therefore, 



sup w^-W^HOp 



L nV /Iogn 



As a consequence, we have 

#(/*,fi*) < R(f n ,n n ) 

'L ny /\ogn 



< i?(/ n ,Q n )+C P 

< %*,fi*) + P 

< i2(/*,n*) + P 



n (l-0/2 



L n yiogn 

n (l-0/2 



L nV /logn 



n (l-0/2 

and the conclusion follows. □ 



VIII. Concluding Remarks 

In this paper we have introduced the nonparanormal, a type of Gaussian copula with non- 
parametric marginals that is suitable for estimating high dimensional undirected graphs. 
The nonparanormal can be viewed as an extension of sparse additive models to the setting 
of graphical models. We proposed an estimator for the component functions that is based on 
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thresholding the tails of the empirical distribution function at appropriate levels. A theoreti- 
cal analysis was given to bound the difference between the sample covariance with respect to 
these estimated functions and the true sample covariance. This analysis was leveraged with 
the recent work of Ravikumar et al. (2009) and Rothman et al. (2008) to obtain consistency 
results for the nonparanormal. Computationally, fitting a high dimensional nonparanormal 
is no more difficult than estimating a multivariate Gaussian, and indeed one can exploit 
existing software for the graphical lasso. Our experimental results indicate that the sparse 
nonparanormal can give very different results than a sparse Gaussian graphical model, sug- 
gesting that it may be a useful tool for relaxing the normality assumption, which is often 
made only for convenience. 
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